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1. Abstract 

In this paper we propose and study a technique to impose structural constraints on the out¬ 
put of a neural network, which can reduce amount of computation and number of parameters 
besides improving prediction accuracy when the output is known to approximately conform to 
the low-rankness prior. The technique proceeds by replacing the output layer of neural network 
with the so-called MLM layers, which forces the output to be the result of some Multilinear 
Map, like a hybrid-Kronecker-dot product or Kronecker Tensor Product. In particular, given an 
“autoencoder” model trained on SVHN dataset, we can construct a new model with MLM layer 
achieving 62% reduction in total number of parameters and reduction of li reconstruction error 
from 0.088 to 0.004. Further experiments on other autoencoder model variants trained on SVHN 
datasets also demonstrate the efficacy of MLM layers. 

2. Introduction 

To the human eyes, images made up of random values are typically easily distinguishable from 
those images of real world. In terms of Bayesian statistics, a prior distribution can be constructed 
to describe the likelihood of an image being a natural image. An early example of such “image 
prior” is related to the frequency spectrum of an image, which assumes that lower-frequency 
components are generally more important than high-frequency components, to the extent that 
one can discard some high-frequency components when reducing the storage size of an image, 
like in JPEG standard of lossy compression of images (Wallace, 1991). 

Another family of image prior is related to the so called sparsity pattern(Candes et ah, 2006, 
Candfe et ah, 2006, Candes et ah, 2008), which refers to the phenomena that real world data can 
often be constructed from a handful of exemplars modulo some negligible noise. In particular, 
for data represented as vector d G F™ that exhibits sparsity, we can construct the following 


1 



approximat ion: 


dwDx, (1) 

where D G is referred to as dictionary for d and x is the weight vector combining rows of 

the dictionary to reconstruct d. In this formulation, sparsity is reflected by the phenomena that 
number of non-zeros in x is often much less than its dimension, namely: 

llx||o<m. (2) 

The sparse representation x may be derived in the framework of dictionary learning(01shausen 
and Field, 1997, Mairal et ah, 2009) by the following optimization: 

min||M-Va;iD,||i^-kA/(x), (3) 

i 

where is a component in dictionary, / is used to induce sparsity in x, with £i being a possible 
choice. 

2.1 Low-Rankness as Sparse Structure in Matrices 

When data is represented as a matrix M, the formulation of 3 is related to the rank of M by 
Theorem ?? in the following sense: 


rank(M) = min ||x||o 

X 

(4) 

s.t. M = ^ XiDi, 

(5) 

Di = UivA, uAuj = 1, vAvi = 1. 

(6) 


Hence a low rank matrix M always has a sparse representation w.r.t. some rank-1 orthogonal 
bases. This unified view allows us to generalize the sparsity pattern to images by requiring the 
underlying matrix to be low-rank. 

When an image has multiple channels, it may be represented as a tensor T G of 

order 3. Nevertheless, the matrix structure can still be recovered by unfolding the tensor along 
some dimension. 

In Figure 1, it is shown that the unfolding of the RGB image tensor along the width dimen¬ 
sion can be well approximated by a low-rank matrix as energy is concentrated in the first few 
components. 

In particular, given Singular Value Decomposition of a matrix M = UDV*, where U, V are 
unitary matrices and D is a diagonal matrix with the diagonal made up of singular values of M, 
a rank-fc approximation of M G F™^” is: 

M « Mfe = UDV*, U G D G V G F"^'^, (7) 

where U and V are the first /c-columns of the U and V respectively, and D is a diagonal 
matrix made up of the largest k entries of D. 

In this case approximation by SVD is optimal in the sense that the following holds (Horn 
and Johnson, 1991): 


= inf ||X - M||f s.t. rank(X) < r. 


( 8 ) 
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(a) The unfolding of an RGB sample image. 


300 


CD 250 


200 


50 


^100 


100 200 300 400 

index of singular value 

(b) The singular values 


Figure 1: An illustration of the singular values of the unfolding of an RBG sample image. 


and 


= inf ||X-M ||2 s.t. rank(X) < r. (9) 

An important variation of the SVD-based low rank approximation is to also model the i\ 
noise in the data: 

||M||flPCA = inf ||M - S|U + A||S||i. (10) 

This norm which we tentatively call “RPCA-norm” (Candes et ah, 2011), is well-defined as 
it is infimal convolution (Rockafellar, 2015) of two norms. 

2.2 Low Rank Structure for Tensor 

Above, we have used the low-rank structure of the unfolded matrix of a rank-3 tensor correspond¬ 
ing to RGB values of an image to reflect the low rank structure of the tensor. However, there 
are multiple ways to construct of matrix D G g i^y stacking R, G and B together 

horizontally. An emergent question is if we construct a matrix D' G ]g3mxn stacking R, G 
and B together vertically. Moreover, it would be interesting to know if there is a method that 
can exploit both forms of constructions. It turns out that we can deal with the above variation by 
enumerating all possible unfoldings. For example, the nuclear norm of a tensor may be defined 
as: 

Defiuitiou 1 Given a tensor £, fet = fold~^(£) he unfolding of £ to matrices. The nuclear 
norm of £ is defined w.r.t. some weights Pi (satisfying Pi > 0) as a weighted sum of the matrix 
nuclear norms of unfolded matrices E(j) as: 

||£||* = ^A|lE(q|h. (11) 

i 

Gonsequently, minimizing the tensor nuclear norm will minimize the matrix nuclear norm of 
all unfoldings of tensor. Further, by adjusting the weights used in the definition of the tensor 
nuclear norm, we may selectively minimize the nuclear norm of some unfoldings of the tensor. 
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2.3 Stronger Sparsity for Tensor by Kronecker Product Low Rank 

When an image is taken as matrix, the basis for low rank factorizations are outer products of row 
and column vectors. However, if data exhibits Principle of Locality, then only adjacent rows and 
columns can be meaningfully related, meaning the rank in this decomposition will not be too low. 
In contrast, local patches may be used as basis for low rank structure (Elad and Aharon, 2006, 
Yang et ah, 2008, Schaeffer and Osher, 2013, Dong et ah, 2014b, Yoon et ah, 2014). Further, 
patches may be grouped before assumed to be of low-rank (Buades et ah, 2005, Hu et ah, 2015, 
Kwok, 2015). 

A simple method to exploit the low-rank structure of the image patches is based on Kronecker 
Product SVD of matrix M G 


rank( 5 ?(M)) 

M= ^ (12) 

i=l 

where ^(A) is the operator defined in (Van Loan, 2000, Van Loan et ah, 1993) which shuffles 
indices. 


.^(A (g) B) = vec A(vecB)^. (13) 

Note that outer product is a special case of Kronecker product when A € and B G 

Fixra, have SVD as a special case of KPSVD. The choice of shapes of A and B, 

however, affects the extent to which the underlying sparsity assumption is valid. Below we give 
an empirical comparison of a KPSVD with SVD. The image is of width 480 and height 320, and 
we approximate the image with KPSVD and SVD respectively for different ranks. To make the 
results comparable, we let B G to make the number of parameters in two approach equal. 

We may extend Kronecker product to tensors as Kronecker Tensor Product as: 

Definition 2 Kronecker tensor product(Phan et al., 2013, 2012) is defined for two tensors of 
the same order k. I.e., for two tensors: 


yj g ]pmixm2,'" jXmfc 


and 


g g ]P'niXn2,'" ,xnfc 

we define Kronecker product of tensor as 

(A 0 rnod mi,12 mod m2,-- - ,ik mod i 


(14) 

(15) 

(16) 


where 


A® B ^ ,xmknk 


(17) 


With the help of Kronecker Tensor Product, (Phan et ah, 2013, 2012) is able to extend 
KPSVD to tensors as: 


rank(^(T)) 

T = ^ cTiUi (g) Vi, 

2=1 


( 18 ) 


where is a matrix. 
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(a) Original image 
selected from BSD500 
dataset(Arbelaez et al., 
2011) 




(c) KPSVD approximations with right matrix having shape 16x20 and with rank 1, 2, 5, 10, 20 respectively 
from left to right 


Figure 2: This figures visually compares the results of KPSVD and SVD approximation given 
the same number of parameters. For this example, it can be seen that KPSVD with 
right matrix shape 16x20 is considerably better than SVD in approximation. 
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3. Prediction Regularization by Structural Constraint 

In a neural network like 


infrf(Y,/(X;0)), (19) 

If /(X; 6), the prediction of the network, is known to be like an image, it is desirable to use 
the image priors to help improve the prediction quality. An example of such a neural network 
is the so-called “Autoencoder” (Vincent et ah, 2008, Deng et ah, 2010, Vincent et ah, 2010, Ng, 
2011), which when presented with a possibly corrupted image, will output a reconstructed image 
that possibly have the noise suppressed. 

One method to exploit the prior information is to introduce an extra cost term to regularize 
the output as 


inf d(Y, Z) -k Ar(Z) s.t. Z = /(X; 9), (20) 

0 

The regularization technique is well studied in the matrix and tensor completion literature 
(Liu et ah, 2013, Tomioka et ah, 2010, Gandy et ah, 2011, Signoretto et ah, 2011, Kressner et ah, 
2013, Bach et al., 2012). For example, nuclear norm ||Z||*, which is sum of singular values, or 
logarithm of determinant log(det(eI-|-ZZ^)) (Fazel et al., 2003), may be used to induce low-rank 
structure if Z is a matrix. It is even possible to use RPCA-norm to better handle the possible 
sparse non-low-rank components in Z by letting r(Z) = |jZ||flpcA- 

However, using extra regularization terms also involve a few subtleties: 

1. The training of neural network incurs extra cost of computing the regularizer terms, which 
slows down training. This impact is exacerbated if the regularization terms cannot be 
efficiently computed in batch, like when using the nuclear norm or RPCA-norm regularizers 
together with the popular mini-batch based Stochastic Gradient Descent training algorithm 
(Bottou, 2010). 

2. The value of A is application-specific and may only be found through grid search on vali¬ 
dation set. For example, when r reflects low-rankness of the prediction, larger value of A 
may induce result of lower-rank, but may cause degradation of the reconstruction quality. 

We take an alternative method by directly restricting the parameter space of the output. 
Assume in the original neural network, the output is given by a Fully-Connected (FC) layer as: 

La = h{La-lMa + ha). ( 21 ) 

It can be seen that the output G jpmxn number of free parameters. However, noting 

that the product of two matrices A G and B G will have the property 

rank(AB) < r, (22) 

when m> r and n > r. 

Hence we may enforce that rank(Lo) < r by the following construct for example: 

La = ft-(La-lMa + bo)/l(Lo-lNa + Ca), (23) 

when we have: 

La_lMa + baGF™><” (24) 

La-lNa+Ca (25) 
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In a Convolutional Neural Network where intermediate data are represented as tensors, we 
may enforce the low-rank image prior similarly. In fact, by proposing a new kind of output layer 
to explicitly encode the low-rankness of the output based on some kinds of Multilinear Map, like 
a hybrid of Kronecker and dot product, or Kronecker tensor product, we are able to increase the 
quality of denoising and reduce amount of computation at the same time. We outline the two 
formulations below. 

Assume each output instance of a neural network is an image represented by a tensor of order 
3: 


^ '^C'xH'kW 


(26) 


where (7, H and W are number of channels, height and width of the image respectively. 


3.1 KTP layer 

In KTP layers, we approximate T by Kronecker Tensor Product of two tensors. 

T-A®B. (27) 

As by applying the shuffle operator defined in (Van Loan, 2000, Van Loan et ah, 1993), 28 is 
equivalent to: 


^%{T) « vecMvecK^, (28) 

hence we are effectively doing rank-1 approximation of the matrix ^(T'). A natural extension 
would then be to increase the number of components in approximation as: 

K 

(29) 

We may further combine the multiple shape formulation of 29 to get the general form of KTP 
layer: 


.7 K 

T ~ Aij 0 Bij. (30) 

j=ii=i 

where and are of the same shape respectively. 


3.2 HKD layer 

In HKD layers, we approximate T by the following multilinear map between A G pCixT/ixiVi 
and B G ^^ 1 x 02 x 772 x 1472 ^ which is a hybrid of Kronecker product and dot product: 




c,h,w 


T, 


c,h,w 


= T, 


c,h\-\-H\*h2,'UJi + Wi*-W2 




Cl ,7l2 "^C,Ci ,/li ,WI ^ 


(31) 


where h = hih 2 , w = wiW 2 - 

The rationale behind this construction is that the Kronecker product along the spatial dimen¬ 
sion of H and W may capture the spatial regularity of the output, which enforces low-rankness; 
while the dot product along C would allow combination of information from multiple channels 
of A and B. 
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In the framework of low-rank approximation, the formulation 31 is by no means unique. One 
could for example improve the precision of approximation by introducing multiple components 
as: 


Tr 


Tc^h\+H\*h2^wi+Wi^W2 


EE 


Ah 


k,ci,h2 ^W2 ^k,c^c\ ,hi^wi- 


(32) 


Hereafter we would refer to layers constructed following 31 as HKD layers. The general name of 
MLM layers will refer to all possible kinds of layers that can be constructed from other kinds of 
multilinear map. 


3.3 General MLM layer 

A Multilinear map is a function of several variables that is linear separately in each variable as: 

f:ViX---xVr,^W, (33) 

where Vi,... ,Vn and W are vector spaces with the following property: for each i, if all of the 
variables but v^are held constant, then f{vi,... ,u„) is a linear function of Ui(Lang, 2002). It is 
easy to verify that Kronecker product, convolution, dot product are all special cases of multilinear 
map. 

Figure 3 gives a schematic view of general structure of the MLM layer. The left factor and 
right factor are produced from the same input, which are later combined by the multilinear map 
to produce the output. Depending on the type of multilinear map used, the MLM layer will 
become HKD layer or KTP layer. We note that it is also possible to introduce more factors than 
two into an MLM layer. 


4. Empirical Evaluation of Multilinear Map Layer 

We next empirically study the properties and efficacy of the Multilinear Map layers and compare 
it with the case when no structural constraint is imposed on the output. 

To make a fair comparison, we first train a covolutional autoencoder with the output layer 
being a fully-connected layer as a baseline. Then we replace the fully-connected layer with 
different kinds of MLM layers and train the new network until quality metrics stabilizes. For 
example. Figure 4 gives a subjective comparison of HKD layer with the original model on SVHN 
dataset. We then compare MLM layer method with the baseline model in terms of number of 
parameters and prediction quality. We do the experiments based on implementation of MLM 
layers in Theano(Bergstra et ah, 2010, Bastien et ah, 2012) framework. 

Table 4 shows the performance of MLM layers on training an autoencoder for SVHN digit 
reconstruction. The network first transforms the 40 x 40 input image to a bottleneck feature 
vector through a traditional ConvNet consisting of 4 convolutional layers, 3 max pooling layers 
and 1 fully connected layer. Then, the feature is transformed again to reconstruct the image 
through 4 convolutional layers, 3 un-pooling layers and the last fully-connected layer or its 
alternatives as output layer. The un-pooling operation is implemented with the same approach 
used in (Dosovitskiy et ah, 2014), by simply setting the pooled value at the top-left corner of 
the pooled region, and leaving other as zero. The fourth column of the table is the number 
of parameters in fully-connected layer or its alternative MLM layer. By varying the length of 
bottleneck feature and using different alternatives for FC layer, we can observe that both HKD 
and KTP layer are good alternatives for FC layer as output layer, and they also both significantly 
reduce the number of parameters. We also tested the case with convolutional layer as output 
layer, and the result still shows the efficacy of MLM layer. 




Figure 3: Diagram of the structure of a MLM layer, the output of the preceding layer is fed 
into two nodes where left factor tensor and right factor tensor are computed. Then a 
multilinear map is applied on the left and right factors to construct the output. 
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(a) Original image patches selected from SVHN dataset 



(b) Results of an Autoencoder Al 

1 1 : f 

(c) Results of an Autoencoder A2 with less neurons in the bottleneck layer than Al 



(d) Results of an Autoencoder with as many neurons in the bottleneck layer as A2 but uses HKD 
as output layer 



Figure 4: This figures show the results of passing five cropped patched from SVHN dataset 
as input images through Autoencoders with different output layers. The first row 
contains the original images. The second row contains the output of an Autoencoder 
“Al”. The third row contains the output of an Autoencoder “A2”, which has smaller 
number of hidden units in the bottleneck layer. The fourth row contains the output of 
an Autoencoder “A3” constructed from “A2” by replacing the output FC layer with 
a HKD layer. It can be seen that “A3”, though with smaller number of hidden units 
in bottleneck layer, visually performs better than “Al” and “A2” in reconstruting the 
input images. 
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Table 1: Evaluation of MLM layers on SVHN digit reconstruction 


model 

bottleneck 
^hidden unit 

total ^params 

layer ^params 

L2 error 

conv -1- FC 

512 

13.33M 

5764800 

3.4e-2 

conv -1- HKD 

512 

4.97M 

46788 

5.2e-4 

conv -1- HKD 
multiple components 

512 

5.05M 

118642 

3.6e-4 

conv -1- FC 

64 

13.40M 

5764800 

1.3e-l 

conv-l- HKD 

64 

5.04M 

46788 

1.8e-3 

conv -1- KTP 

64 

5.04M 

46618 

3.2e-3 

conv 

16 

5.09M 

0 

4.0e-2 

conv -1- FC 

16 

13.36M 

5764800 

8.8e-2 

conv -1- HKD 

16 

5.00M 

46788 

3.8e-3 

conv -1- KTP 

16 

5.00M 

46618 

5.8e-3 


As the running time may depend on particular implementation details of the KFC and the 
Theano framework, we do not report running time. However, the complexity analysis suggests 
that there should be significant reduction in amount of computation. 

5. Related Work 

In this section we discuss related work not covered in previous sections. 

Low rank approximation has been a standard tool in restricting the space of the parameters. 
Its application in linear regression dates back to (Anderson, 1951). In (Sainath et ah, 2013, Liao 
et ah, 2013, Xue et ah, 2013, Zhang et ah, 2014b, Denton et ah, 2014), low rank approximation 
of fully-connected layer is used; and (Jaderberg et ah, 2014, Rigamonti et ah, 2013, Lebedev 
et ah, 2014b,a, Denil et ah, 2013) also considered low rank approximation of convolution layer. 
(Zhang et ah, 2014a) considered approximation of multiple layers with nonlinear activations. To 
our best knowledge, these methods only consider applying low-rank approximation to weights of 
the neural network, but not to the output of the neural network. 

As structure is a general term, there are also other types of structure that exist in the desired 
prediction. Neural network with structured prediction also exist for tasks other than autoencoder, 
like edge detection(Dollar and Zitnick, 2013), image segmentation (Zheng et ah, 2015, Farabet 
et ah, 2013), super resolution(Dong et ah, 2014a), image generation(Dosovitskiy et ah, 2014). 
The structure in these problem may also exhibit low-rank structure exploitable by MLM layers. 

6. Conclusion and Future Work 

In this paper, we propose and study methods for incorporating the low-rank image prior to the 
predictions of the neural network. Instead of using regularization terms in the objective function 
of the neural network, we directly encode the low-rank constraints as structural constraints by 
requiring the output of the network to be the result of some kinds of multilinear map. We consider 
a few variants of multilinear map, including a hybrid-Kronecker-dot product and Kronecker 
tensor product. We have found that using the MLM layer can significantly reduce the number 
of parameters and amount of computation for autoencoders on SVHN. 

As future work, we note that when using l\ norm as objective together with the structural 
constraint, we could effectively use the norm defined in Robust Principal Value Analysis as our 
objective, which would be able to handle the sparse noise that may otherwise degrade the low- 
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rankness property of the predictions. In addition, it would be interesting to investigate applying 
the structural constraints outlined in this paper to the output of intermediate layers of neural 
networks. 
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